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Abstract 

A new procedure for learning cost-sensitive SVM(CS-SVM) classifiers is proposed. The SVM 
hinge loss is extended to the cost sensitive setting, and the CS-SVM is derived as the minimizer 
of the associated risk. The extension of the hinge loss draws on recent connections between risk 
minimization and probability elicitation. These connections are generalized to cost-sensitive classi- 
fication, in a manner that guarantees consistency with the cost-sensitive Bayes risk, and associated 
Bayes decision rule. This ensures that optimal decision rules, under the new hinge loss, implement 
the Bayes-optimal cost-sensitive classification boundary. Minimization of the new hinge loss is 
shown to be a generalization of the classic SVM optimization problem, and can be solved by iden- 
tical procedures. The dual problem of CS-SVM is carefully scrutinized by means of regularization 
theory and sensitivity analysis and the CS-SVM algorithm is substantiated. The proposed algo- 
rithm is also extended to cost-sensitive learning with example dependent costs. The minimum cost 
sensitive risk is proposed as the performance measure and is connected to ROC analysis through 
vector optimization. The resulting algorithm avoids the shortcomings of previous approaches to 
cost-sensitive SVM design, and is shown to have superior experimental performance on a large 
number of cost sensitive and imbalanced datasets. 

Keywords: Cost Sensitive Learning, SVM, probability elicitation, Bayes consistent loss 



1. Introduction 

The most popular strategy for the design of classification algorithms is to minimize the probability 
of error, assuming that all misclassifications have the same cost. The resulting decision rules are 
usually denoted as cost-insensitive. However, in many important applications of machine learning, 
such as medical diagnosis, fraud detection, or business decision making, certain types of error are 
much more costly than others. Other applications involve significantly unbalanced datasets, where 
examples from different classes appear with substantially different probability. It is well known, 
from Bayesian decision theory, that under any of these two situations (uneven costs or probabilities), 
the optimal decision rule deviates from the optimal cost-insensitive rule in the same manner. In both 
cases, reliance on cost insensitive algorithms for classifier design can be highly sub-optimal. While 
this makes it obviously important to develop cost-sensitive extensions of state-of-the-art machine 
learning techniques, the current understanding of such extensions is limited. 

In this work we consider the support vector machine (SVM) architecture Cortes and Vapnik 
(1995). Although SVMs are based on a very solid learning-theoretic foundation, and have been 
successfully applied to many classification problems, it is not well understood how to design cost- 
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sensitive extensions of the SVM learning algorithm. The standard, or cost-insensitive, SVM is 
based on the minimization of a symmetric loss function (the hinge loss) that does not have an 
obvious cost-sensitive generalization. In the literature, this problem has been addressed by various 
approaches, which can be grouped into three general categories. The first is to address the problem 
as one of data processing, by adopting resampling techniques that under-sample the majority class 
and/or over-sample the minority class Kubat and Matwin (1997); Chawla et al. (2002); Akbani et al. 
(2004); Geibel et al. (2004); Zadrozny et al. (2003). Resampling is not easy when the classification 
unbalance is due to either different misclassification costs (not clear what the class probabilities 
should be) or an extreme unbalance in class probabilities (sample starvation for classes of very low 
probability). It also does not guarantee that the learned SVM will change, since it could have no 
effect on the support vectors. Active learning based methods have also been proposed to train the 
SVM algorithm on the informative instances, instances which are close to the hyperplane Ertekin 
et al. (2007). 

The second class of approaches Amari and Wu (1999); Wu and Chang (2003, 2005) involve 
kernel modifications. These methods are based on conformal transformations of the input or feature 
space, by modifying the kernel used by the SVM. They are somewhat unsatisfactory, due to the 
implicit assumption that a linear SVM cannot be made cost-sensitive. It is unclear why this should 
be the case. 

The third, and most widely researched, approach is to modify the SVM algorithm in order to 
achieve cost sensitivity. This is done in one of two ways. The first is a naive method, known as 
boundary movement (BM-SVM), which shifts the decision boundary by simply adjusting the thresh- 
old of the standard SVM Karakoulas and Shawe-Taylor (1999). Under Bayesian decision theory, 
this would be the optimal strategy if the class posterior probabilities were available. However, it is 
well known that S VMs do not predict these probabilities accurately. While a literature has developed 
in the area of probability calibration Piatt (2000), calibration techniques do not aid the cost-sensitive 
performance of threshold manipulation. This follows from the fact that all calibration techniques 
rely on an invertible (monotonic and one-to-one) transformation of the SVM output. Because the 
manipulation of a threshold at either the input or output of such a transformation produces the same 
receiver-operating-characteristic (ROC) curve, calibration does not change cost-sensitive classifi- 
cation performance. The boundary movement method is also obviously flawed when the data is 
non-separable, in which case cost-sensitive optimality is expected to require a modification of both 
the normal of the separating plane w and the classifier threshold b. The second proposal to modify 
SVM learning is known as the biased penalties (BP-SVM) method Bach et al. (2006); Lin et al. 
(2002); Davenport et al. (2006); Wu and Srihari (2003); Chang and Lin (2011). This consists of 
introducing different penalty factors C\ and C_i for the positive and negative SVM slack variables 
during training. It is implemented by transforming the primal SVM problem into 



argmin -||w|| 2 + C 



W,b,t; 2 



{%!=!} {i|j/i=-l} 

subject to yi(w T x + b) > 1 — £j. 



(1) 



The biased penalties method also suffers from an obvious flaw, which is converse to that of the 
boundary movement method: it has limited ability to enforce cost-sensitivity when the training data 
is separable. For large slack penalty C, the slack variables ^ are zero-valued and the optimization 
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above degenerates into that of the standard SVM, where the decision boundary is placed midway 
between the two classes rather than assigning a larger margin to one of them. 

In this work we propose an alternative strategy for the design of cost-sensitive SVMs. This 
strategy is fundamentally different from previous attempts, in the sense that is does not directly ma- 
nipulate the standard SVM learning algorithm. Instead, we extend the SVM hinge loss, and derive 
the optimal cost-sensitive learning algorithm as the minimizer of the associated risk. The derivation 
of the new cost-sensitive hinge loss draws on recent connections between risk minimization and 
probability elicitation Masnadi-Shirazi and Vasconcelos (2009). Such connections are generalized 
to the case of cost-sensitive classification. 

It is shown that it is always possible to specify the predictor and conditional risk functions de- 
sired for the SVM classifier, and derive the loss for which these are optimal. A sufficient condition 
for the cost-sensitive Bayes-optimality of the predictor is then provided, as well as necessary condi- 
tions for conditional risks that approximate the cost-sensitive Bayes risk. Together, these conditions 
enable the design of a new hinge loss which is minimized by an SVM that 1) implements the cost- 
sensitive Bayes decision rule, and 2) approximates the cost-sensitive Bayes risk. It is also shown 
that the minimization of this loss is a generalization of the classic SVM optimization problem, and 
can be solved by identical procedures. The resulting algorithm avoids the shortcomings of previous 
methods, producing cost-sensitive decision rules for both cases of separable and inseparable training 
data. Experimental results show that these advantages result in better cost-sensitive classification 
performance than previous solutions. 

Since CS-SVM is implemented in the dual, cost-sensitive learning in the dual should be stud- 
ied more closely. We show that cost-sensitive learning in the dual appears as regularization and 
changing the constraint's upper bounds which stem from sensitivity analysis. These connections 
are considered under cost-sensitive learning and imbalanced data learning. 

Moreover, we show that in the cost-sensitive and imbalanced data settings, the priors and costs 
should be incorporated in the performance measure. We propose minimum expected (cost-sensitive) 
risk as a cost sensitive performance metric and demonstrate its connections to the ROC curve. For 
the case of unknown costs, we introduce a robust measure which reflects the performance of the 
classifier under a given tolerance of false-positive or false-negative errors. 

The paper is organized as follows. Section 2 briefly reviews the probability elicitation view 
of loss function design Masnadi-Shirazi and Vasconcelos (2009). Section 3 then generalizes the 
connections between probability elicitation and risk minimization to the cost-sensitive setting. In 
Section 4, these connections are used to derive the new SVM loss and algorithm. In section 5, 
the dual problem of CS-SVM is thoroughly evaluated in the sense of regularization and sensitiv- 
ity analysis. Section 6 presents an extension of CS-SVM for problems with example-dependent 
costs. Section 7 proposes minimum cost sensitive risk as a standard measure for examining clas- 
sifier performance in the cost-sensitive and imbalanced data setting. Finally, Section 8 presents 
an experimental evaluation that demonstrates improved performance of the proposed cost sensitive 
SVM over previous methods. 

2. Bayes consistent classifier design 

The goal of classification is to map feature vectors x £ X to class labels y £ { — 1,1}. From a 
statistical viewpoint, the feature vectors and class labels are drawn from probability distributions 
Px(x) and Py(u) respectively. In terms of functions, we write a classifier as /i(x) = sign[p(-x.)], 
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where the function p : X — > R is denoted as the classifier predictor. Given a non-negative function 
L(p(x),y) that assigns a loss to each (p(x),y) pair, the classifier is considered optimal if it min- 
imizes the expected loss R = Ex,Y[L(p(x),y)], also known as the risk. Minimizing the risk, is 
itself equivalent to minimizing the conditional risk 

£V| X [L(p(x),y)|X = x] = iV|x(l|x)L(p(x), 1) 

+(l-Py|x(lW)L(p(x) l -l), (2) 

for all x G X. It is discerning to write the predictor function p(x) as a composition of two functions 
p(x) = /(?7(x)), where ry(x) = Py|x(l|x) is the posterior probability , and / : [0, 1] — > R is 
denoted as the link function. This provides a valuable connection to the Bayes decision rule. A loss 
is considered Bayes consistent when its associated risk is minimized by the BDR. For example the 
zero-one loss can be written as 

r /, \ 1 - sign{yf) 
Lo/i{f,y) = 2 

0, if y = sign(f); 

1, ify sign(f), 

where we omit the dependence on x for notational simplicity. The conditional risk for this loss 
function is 

n , f s 1 - sign(f) 1 + sign(f) 
Co/i(v,f) = V ^ + ^ 

l-?7, if/>0; 

n, if/<0. () 
This risk is minimized by any predictor /* such that 

/*(x)>0 if?7(x)>7 

/*(x) = if7 ? (x) = 7 (5) 
f(x)<0 if7?(x)< 7 

and 7 = |. Examples of optimal predictors include f* = 2r] — l and /* = log The associated 
optimal classifier /?* = signf/*] is the well known Bayes decision rule thus proving that the zero- 
one loss is Bayes consistent. Finally, the associated minimum conditional (zero-one) risk is 

Co/iM = r >(\- l si 9n(2v ~ 1)) + 

(1 + \sign{2r) - 1)) . (6) 

A handful of other losses have been shown to be Bayes consistent. These include the exponential 
loss used in boosting classifiers Friedman et al. (2000), logistic loss of logistic regression Friedman 
et al. (2000); Zhang (2004), or the hinge loss of SVMs Zhang (2004). These losses are of the 
form L^(f, y) = <p(yf) for different functions <p(-) and are known as margin losses. Margin losses 
assign a non-zero penalty to small positive yf, encouraging the creation of a margin. The resulting 
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large-margin classifiers have better generalization than those produced by the zero-one loss or other 
losses that do not enforce a margin Vapnik (1998). For a margin loss, the conditional risk is simply 

CM /) = #(/) + (l-r ? )0(-/). (7) 

The conditional risk is minimized by the predictor 

= ar § mi 11 C <p(v, f) (8) 

and the minimum conditional risk is C^(rj) = C^rj, /?). 

Recently, a generative formula for the derivation of novel Bayes consistent loss functions has 
been presented in Masnadi-Shirazi and Vasconcelos (2009) relying on classical probability elicita- 
tion in statistics Savage (1971). Comparable to risk minimization, in probability elicitation, the goal 
is to find the probability estimator r) that maximizes the expected reward 

I( V ,fj) = Vli(f)) + (l-r ] )I- 1 (f)), (9) 

where I\{fi) is the reward for predicting r) when event y = 1 holds and I~i(rj) the corresponding 
reward when y = —1. The functions I\ (•),/_!(•) are such that the expected reward is maximal 
when 77 = ?], i.e. 

/(r/,?/) < 1(77,77) = J(rj), V77 (10) 

with equality if and only if fj = rj. 

Theorem 1 Savage (1971) Let 1(77,77) and J(rj) be as defined in (9) and (10). Then 1) J (77) is 
convex and 2) (10) holds if and only if 

h (77) = J(t?) + (1 - 77) J' (77) (11) 
= J(t?)- 77/(77). (12) 



The theorem states that ii(-), ^-i(0 can be derived such that (10) holds by applying an appro- 
priate convex J {if). This primary theorem was used in Masnadi-Shirazi and Vasconcelos (2009) to 
establish the following for margin loss functions. 

Theorem 2 Masnadi-Shirazi and Vasconcelos (2009) Let J {if) be as defined in (10) and f a con- 
tinuous function. If the following properties hold 

1. J(? ? ) = J(l - r?), 

2. / is invertible with symmetry 

r i (-v) = i-r i (v), (13) 

then the functions and I-i(-) derived with (11) and (12) satisfy the following equalities 

IM = -4>(f(v)) (14) 

J_ 1 (7?) = -0(-/(t?)), (15) 

with 

<Kv) = -j[r\v)] - (1 - rHv))j'[f-\v)]. (i6) 
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This theorem provides a generative path for designing Bayes consistent margin loss functions for 
classification. Specifically, any convex symmetric function J(rj) = — C^(r/) and invertible function 
f~ l satisfying (13) can be used in equation (16) to derive a novel Bayes consistent loss function 
4>{v). This is in contrast to previous approaches which require guessing a loss function <fi(v) and 
checking that it is Bayes consistent by minimizing C^rj, f), so as to obtain whatever optimal pre- 
dictor /? and minimum expected risk C|(ry) results Zhang (2004) or methods that restrict the loss 
function to being convex, differentiable at zero, and have negative derivative at the origin Bartlett 
et al. (2006). 



3. Cost sensitive Bayes consistent classifier design 

In this section we extend the connections between risk minimization and probability elicitation to 
the cost-sensitive setting. We start by reviewing the cost-sensitive zero-one loss. 



3.1 Cost-sensitive zero-one loss 

The cost-sensitive extension of the zero-one loss is 



£Ci,C_i(/,y) 
1 - sign(yf) 



Ci 



1 - sign(f) 1 + sign(f) 
r 0_i 



2 

0, if y = sign{f); 

Ci, if y = 1 and sign(f) = 

, C-i, if y = -1 and sign(f) 



(17) 



where C\ is the cost of a false negative and C_i that of a false positive. The associated conditional 
risk is 



C Cl ,c_ 1 (r?,/) = 

1 - sign(f) 1 + sign(f) 

CiV g + ( 1 2 

C_i(1-t/), if/>0; 
dr?, if / < 0, 



(18) 



and is minimized by any predictor that satisfies (5) with 7 = c ■ Examples of optimal pre- 
dictors include f*(rf) = (C\ + C-\)t] — C_i and f*(rf) = log nz^j^ 1 ■ The associated optimal 
classifier h* = sign[f*] implements the cost-sensitive Bayes decision rule, and the associated min- 
imum conditional (cost-sensitive) risk is 



Ch lt C-M = Cm Q - \sign \f*(r,)]j + 



CM1-V)(l + Isign[f*m) (19) 

with f*(rj) = (Ci + C-i)r] — C_i. We show that the minimum cost sensitive zero-one risk is 
equivalent to the minimum cost sensitive Bayes error. 
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Theorem 3 The minimum risk associated with the cost sensitive zero-one loss is equal to the mini- 
mum cost sensitive Bayes error. 



x = (20) 



Proof 

flc„c_, =ExlC'c 1 ,C-Al)] = J P(x)C- CltC JP(Ux))d 

1 f (x|l) + f (x|-l) _ PW + 

r p(,|i) + p(,|-i) PWi) 

yp(l|a!)<7 2 P(x\l) +P{x\ - 1) 

- / C- X P{x\-l)dx + \ l C 1 P(x\l)dx = (23) 

2 JP(l|x)>7 2 /p(l|x)<7 

l(C_ ie 7 + C 1 e^) = e c7l ,C_ 1 (24) 

where and are the miss rate and false positive rate associated with the cost sensitive threshold 7 
and ec lt C-i is the cost sensitive Bayes error rate. We have also assumed, without loss of generality, 
that the prior probabilities are equal. ■ 

The next theorem highlights some fundamental properties of the minimum conditional cost- 
sensitive zero-one risk. 

Theorem 4 The risk of (19) has the following properties: 

C-i 



1. a maximum at rf = c ^_q 

2. symmetry defined by, Ve € 



I 

u > C1+C-1 



C* (77* - CLie) = C* (rf + de) 



(25) 



Proof Note that (19) can be written as 



C* M-l if ^^^ (26) 

IsCuC-xW-i Clffi if/*<0, (26) 



The two lines C_i(l — 77) and C177 intersect and form the maximum at 77 
When e = we have the trivial case of C* ( ^ C 'rJ, — ) = C* 



C- 



C1+C-1 



C-i r< C. 



When < e < c _} c — we have rj = c +( j C_ie < c +( j — in which case from (5), 

/* < and 



(27) 
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1 c c 

When < e < c +c _ we also have r] = c + C\e > c in which case from (5), 

/* > and 

ch,c.M = - ,) = o-, (i - - c 1£ ) = _ ClC _ l£ (28) 

Thus proving that 



As noted by the following lemma, property 2. is in fact a generalization of property 1. 
Lemma 1 A«y concave function with the symmetry of (25) also has property 1. of Theorem 4. 
Proof Taking the derivative of (25) at e = leads to 

c *'(c^)<- c -'»= c *'(crffe)< c '» (30) 

which is satisfied only when C*' = ®- Given that C* is a concave function, C* is maxi- 

mum at rj = c^I- ■ 



3.2 Cost-sensitive Bayes consistent margin losses 

We extend the other losses used in machine learning to the cost-sensitive paradigm by introducing 
the following set of margin loss function 

^,Ci,a_i(/,y) = 0Ci,C_i(y/) 

Mf), ify = i nn 
ify = -l. lJ j 

The associated conditional risk is 

c^cAv, f) = vMf) + (i - v)4>-i(f) (32) 

and is minimized by the predictor 

f%,Ci,C-i(y) = argminC^Ci.C-i^, /)• (33) 
This leads to the minimum conditional risk 

+ (»/)). (34) 

Similar to the cost insensitive case, our choice of 4>i{') m (31) cannot be arbitrary and we require 
certain properties for the loss function. These desirable properties are addressed by extending the 
approach of Masnadi-Shirazi and Vasconcelos (2009). 
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Theorem 5 Let g{rj) be any invertible function, J(rj) any convex function, and (fii(-) determined by 
the following steps: 

1. use (11) and (12) to obtain the I\(rj) and I-i(rj), and let C^Cx,C-i(Vt f) ^ e defined by (32). 

2. set 4>i(g(v)) = -h(v) and 0-1 (-£(*/)) = 

Then g(rj) = f^C-iiv) if and only ifJ(rj) = -CJ^cufa). 
Proof From 1. and Theorem 1, it follows that 

77/1(77) + (1-77)11(77) 
has maximum value J(?7), when 77 = 77. From 2. the same holds for 

-vMaiv)) - (1 - v)4>-i(-g(v)) 

and 

J{v) = -vfMv)) - (1 - v)(f>-i{-g(v))- 

It follows from (32)-(34) that, 3(77) = fa) if and only if J(t?) = -C} c ^ (77). 



The theorem provides a generative method for designing the loss functions 0j(-) starting from 
any pair of invertible function g(rj) and convex function J(j7). The resulting loss function will 
satisfy (32)-(34), when g( V ) = f^C-M and J (v) = -C^C-M 

What remains to be answered is how to choose f£ Ci c ^77), and Ci c ^77) so as to en- 
sure cost sensitive Bayes consistency. The following theorem provides a sufficient condition on 
fX Ci c _ i (77) for the Bayes optimality of the loss function. 

Theorem 6 Any invertible predictor f(rj) with symmetry 

r\-v) = 7 ^--r\v) (35) 

satisfies the necessary and sufficient conditions for cost-sensitive optimality of (5) with 7 = c ^ . 
Proof Assume that 7(77) = v is monotonically increasing. Note that / _1 (0) = Cl +p_ 1 which 

1 O C 

along with 77 = f~ (v) leads to f( c ^ ) =0. If 77 > g + ^ then from (35) we have 

Ci+c_i ' a PPlyi n g ( 35 ) a § ain ^ follows that f(rj) > c ^ + c_ 1 ■ Similarly, if 77 < 
then /(7 ? ) < M 

In other words, any predictor fX Cl C l (rj) that satisfies (35) will be guaranteed to have a con- 
ditional risk that is minimized by the cost-sensitive Bayes decision rule. 

What remains to be discussed is how to specify C^ Ci cr _ 1 (77) which will determine the risk 
of the optimal classifier. The goal is to approximate the minimum conditional cost-sensitive zero- 
one risk (minimum cost sensitive Bayes risk) given in (19) as best as possible so as to achieve the 
minimum cost sensitive Bayes error. This is formally presented in the following theorem 
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Theorem 7 The minimum risk of any cost sensitive loss in the form of (31 ) and derived from Theo- 
rem 5 can be made to be arbitrarily close, in the expectation, to the minimum cost sensitive Bayes 
error by choosing the minimum conditional risk of the loss to be arbitrarily close to the minimum 
conditional risk of the cost sensitive zero-one loss function. 

Proof 

R l,Ci,c-! - e c u c-! = Rl,Oi,C-i ~ R Ci,c-! = ( 36 ) 
Where we have used Theorem 3 for the first equality. ■ 



While Theorem 7 says that the true measure for determining C^ Ci c i is the expectation of 
(37), Theorem 4 suggests a simpler rule of thumb for selecting C^ Ci C i . Property 1. assigns the 
largest risk to the locations on the classification boundary and requiring this property for CI Ci C >_ 1 
would be vital. Also, enforcing Property 2. further guarantees that the optimal risk has the symmetry 
of the minimum cost-sensitive Bayes risk. 

Definition 2 A minimum risk C^ Ci c 1 (r/) is of 

1. Type-I if it satisfies property 1. but not 2. of Theorem 4. 

2. Type-II if it satisfies both properties 1. and 2. 

Risks of type-II are generally closer approximations to the cost-sensitive Bayes risk than those of 
type I. Although, strictly speaking the true measure is the expectation of (37). 

The combination of Theorems 4-7 leads to a generic procedure for the design of cost-sensitive 
classification algorithms, consisting of the following steps 

1. select a predictor /| Ci c _ i (rj) that satisfies (35). 

2. select a concave minimum conditional risk using the measure of (37) or, as a simpler rule 
of thumb alternative, select a concave minimum conditional risk C? Ci c^ 1 (. 7 l) °f type-I or 
type-II, which reduces to CI (rj) when C\ = C_i = 1. 

3. use (11) and (12) with J(r/) = -CI Ci c (rj) to obtain Ii(r]) and I-i(r]). 

4. find&(.) so that/! (r?) = -<hU;,c u C-M) and ^iW = ft Cl ,c^))- 

5. derive an algorithm to minimize the conditional risk of (32). 

We next illustrate the practical application of this framework by showing that the cost-sensitive 
exponential loss of Masnadi-Shirazi and Vasconcelos (2007) can be derived from a minimal condi- 
tional risk of Type-I. 
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3.3 Cost-sensitive exponential loss 

We start by recalling that AdaBoost is based on the loss <p(yf) = exp(— yf), for which it can be 
shown that 



1 — 77 



and f* = hog-H—. (38) 
v 2, 1 — 77 

A natural cost-sensitive extension is f£ Ci c _ i (77) = Cl + C - log (i^yj 1 > which is easily shown 
to satisfy (35). Noting that C^(rj) = r/exp(— f£) + (1 — r/)exp(/^), suggests the cost-sensitive 
extension 

-Ci 



c 1 +c_ 1 
c_ 

rjCi \ ci+c_x 



(1 -" ) l(T=ip^J • 

This does not have the symmetry of (25) but satisfies property 1. of Theorem 4. Hence, it is a 
Type-I risk. It is also equivalent to (38) when C\ = C_i = 1. Finally, steps 1. and 2. of Theorem 5 
produce the loss 

6 C c (Vf) = { exp( " C ' l/) ' ify = 1 (40) 
<P Cl ,C-AyJ) I exp(C_ 1 /), ify = -l ^ 

proposed in Masnadi-Shirazi and Vasconcelos (2007). The resulting cost-sensitive boosting algo- 
rithm currently holds the best performance in the literature. 

4. Cost sensitive SVM 

Next we extend the hinge loss used in SVMs using the cost sensitive framework established in the 
previous section. The cost sensitive SVM optimization problem is also derived. 

The SVM minimizes the risk of the hinge loss 4>(yf) = [1 — yf\+, where = max(x, 0). 
The associated risk is minimized by Zhang (2004) 

f;(rj) = sign(2r ) - 1) (41) 

resulting in the minimum conditional risk 

C;( V ) = 1-|2»7-:L| 

= 77LI - sign(2i] - 1)J+ + (1 - 77) Ll + sign(2r) - 1)J + . 

We follow the generic procedure and replace the optimal cost-insensitive predictor by its cost- 
sensitive counterpart 

Sl,Ch,C-M = signed + C_x)77 - C_i). (42) 
which can be directly shown to satisfy (5). This suggests choosing the cost-sensitive minimum 
conditional risk 

CicuC.M = Vie ~ d ■ signed + C.^r, - C_i)J + + (43) 
(1 - T])[b + a ■ signed + C_i)r7 - C_i)J+, 
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Figure 1: Left: concave Ci C i (77) function and corresponding cost sensitive SVM loss function, 
top: C\ = 4, C_i = 2, bottom: C\ = C_i = 1. Right: linearly separable cost sensitive 
SVM. 



which can be shown to satisfy (25) if and only if 

d > e a>b and -=! = (44) 

Ci d + e 

The hinge loss minimum conditional risk satisfies the conditions of a Type-II loss function and is 
also a close approximation of the zero-one minimum conditional risk under the criteria of Theorem 
-7. 

After steps 1. and 2. of Theorem 5, 

4>c c (vf) = l Le " d/J + ' ify = 1 (45) 
<PCi,C-xW) <y [b + af\+, ify = -l. K } 

This loss has four degrees of freedom, which control the margin and slope of the hinge components 
associated with the two classes: positive examples are classified with margin | and hinge loss slope 
d, while for negative examples the margin is - and slope a. 

4.1 Cost-sensitive SVM learning 

We consider the case where errors in the positive class are weighted more heavily, leading to the 
inequalities f < f and d > a. Choosing e = d = C\ normalizes the margin of positive examples 
to unity (| = 1). Selecting 6 = 1 then fixes the scale of the negative component of the hinge loss, 
leading to a = 2C_i — 1. The resulting cost sensitive SVM loss function is 

^Cuc-Avf) = l{y=i}Ci|l -y/J+ + l{»=-i} LI - (2C-i - l)yf\+ (46) 
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and the cost sensitive SVM minimal conditional risk is 

Ci Cl ,c.M= ( 47 ) 
77LC1 - d ■ signed + C-Jti - C_ X )J + + 

(1 - rj)[l + (2C7_x - 1) • signed + CL^ - C_!)J + 

with C_i > 1 and Ci > 2C_i — 1, so as to satisfy (44). Figure 1 presents plots of (47) and (46), 
for both C\ = 4, C_i = 2 and the cost insensitive case of C\ = 1, C_i = 1 (standard SVM). 
Note that, for the cost-sensitive SVM, the positive class has a unit margin, while the negative class 
has a smaller margin of |. Also, the slope of the positive component of the loss is 4 while the 
negative component has a smaller slope of 3. In this way, the loss assigns a higher cost to errors 
in the positive class when the data is not separable, while enforcing a larger margin for positive 
examples when the data is separable. Replacing the standard hinge loss with (45) in the standard 
SVM risk Moguerza and Munoz (2006) 



argmin J^Ci - C 1 (w T x i + 6)J + + ^ [1 + ( 2C -i ~ + b)\ + + ^\H\ 2 , (48) 

w ' b {»|j/i=i} {*|i/i=-i} 

leads to the primal problem 



01 E ?<4 E ■ 

{*liK=l} {*liK=-l} 



argmin -j|w|| 2 + C 

u>Mi 2 

subject to (w T Xi + b) > 1 — Hi = 1 

{w T Xi + b) < -K + Ci] Vi = -1 



(49) 



with 



0<«<1<-<Ci. (50) 



2C_i - 1 



This is a quadratic programming problem similar to that of the standard cost-insensitive SVM with 
soft margin weight parameter C. In this case, cost-sensitivity is controlled by the parameters C\ , - , 
and k. The parameter k is responsible for cost-sensitivity in the separable case. Under the con- 
straints C-i >1,C\> 2C-i — 1, (0<k<1<-< C\), of a type-II risk, it imposes a smaller 
margin on negative examples. On the other hand, C\ and - control the relative weights of mar- 
gin violations, assigning more weight to positive violations. This allows control of cost-sensitivity 
when the data is not separable. 

Obviously, this primal problem could be defined through heuristic arguments. However, it 
would be difficult to justify precise choices for the parameters of (50). Furthermore, the deriva- 
tion above guarantees that the optimal classifier implements the Bayes decision rule of (5) with 
c 

7 = d+C 1 ' anc ^ ^ ts r i s k i s a type-II approximation to the cost-sensitive Bayes risk. No such 
guarantees would be possible for an heuristic solution. 

To obtain some intuition about the cost-sensitive extension, we consider the synthetic problem 
of Figure 1, where the two classes are linearly separable. The figure shows three separating lines. 
The green line is an arbitrary separating line that does not maximize the margin. The red line is 
the standard SVM solution, which has maximum margin and is equally distant from the nearest 
examples of the two classes. The blue line is the solution of (49) for C\ = 4 and C_i = 2 (the 
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C parameter is irrelevant when the data is separable). It is also a maximum margin solution, but 
trades-off the distance to positive and negative examples so as to enforce a larger positive margin, 
as specified. Overall, an increase in C_i (decrease in k) guarantees a larger positive margin. For a 
given C_i, increasing C\ (so that C\ > 2C_i — 1) increases the cost of errors on positive examples, 
enabling control of the miss rate when the classes are not separable. 

We note that for the separable case, a limited level of cost sensitive performance can be achieved 
using the BP-SVM formulation of (1) along with a small weight parameter C (C < |), but a small 
C is undesirable in general as it leads to an under trained model with training errors even when 
the data is separable. The CS-SVM formulation, on the other hand, provides a maximum margin 
solution regardless of the chosen weight parameter C . The CS-SVM is preferable even in the 
inseparable case because increasing the weight parameter C, in an attempt to reduce training error, 
inevitably leads to over training in the BP-SVM formulation. This is not necessarily the case for the 
CS-SVM formulation which allows a decrease of the margin of the negative samples (through an 
appropriate choice of k) and a relative increase in the margin of the positive samples, independent 
of the weight parameter C and does not lead to over training. In other words, unlike the BP-SVM 
formulation, the CS-SVM does not simply over train on the positive class, it maximizes the margin 
on this class. This can also be seen, with added clarity, in the dual CS-SVM formulation which is 
discussed in the next section. 

5. Cost-sensitive SVM in the dual 

The dual and kernelized formulation of the CS-SVM of (49) can be derived as 



argmax Z^ ai [ ~ 2 J ~ 2 \» 2 n > n J ! l> ! lj KLr > > J 

subject to aiPi = 



(51) 



< a t < CCi; y t = 1 
C 

< o.i < — ; yi = -1 

K 

which reduces to the standard SVM dual when C\ = C_i = 1. Unlike the previous BM-SVM and 
BP-SVM algorithms, the CS-SVM algorithm performs regardless of the separability of the data and 
the chosen slack penalty C. This can be further studied in detail by writing the dual problem (51) 
as 



a t + K a i - O 22 "i'\i!J'!l.j h 'i-n » x i) 

a 

I % 

subject to otiyi = 



argmax 



* 3 



(52) 

< af < CCi 
C 

< a~ < - 

1 K 
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with 



0<K<l<i<Ci (53) 

K 

oti + = {ui\yi = 1}, af = {ai\yi = -1}. 



Moreover, since a« > and k = 1 — (1 — k) we can rewrite (52) with an h norm term as 



1 



argmax a YKYa + 1 a — (1 



k a 



subject to a T y = 



2 ' ~ " ^ 111 

= n 

(54) 



^ a + r< CCi 
C 

H a" ^ — . 

re 

where Y = Diag(y) and 1 is the vector of all ones. 

When k = 1 and Ci = 1, the problem of (54) reverts to the standard SVM dual formulation. 
This implies that (54) is totally compatible with standard dual solvers and its implementation on 
existing SVM dual solvers is a non-issue. 

If we equivalently transform the problem of (54) into a minimization problem, ||a~ || x acts as an 
l\ regularization term with positive coefficient (1 — n). Another difference with the standard cost 
insensitive SVM (CI-SVM) dual problem is that in (54), the upper bounds on a + and a~ are scaled 
differently. Specifically, because - < C\, the active upper bound constraints on af are relaxed. In 
summary, the CS-SVM dual problem (54) has two major differences compared to the CI-SVM dual 
problem: 

1. l\ regularization on a~ . 

2. relaxed inequality constraints on a + . 

These modifications have profound consequences which connect regularization theory and sensi- 
tivity analysis to cost-sensitive learning. We study the implications of these modifications by first 
representing the CI-SVM dual problem as a regularized risk minimization problem which allows us 
to justify the extra regularization term —(1 — «) || of" Hj for both the case of cost sensitive learning 
and imbalanced learning problems. Subsequently, we study the affect of relaxing the inequality 
constraint on a + using sensitivity analysis. 

5.1 Regularization on Lagrange multipliers 

In this subsection we study the effects of l\ regularization on or in the dual problem, while consid- 
ering imbalanced dataset learning and cost-sensitive learning separately. 

5.1.1 Imbalanced dataset learning 

In many applications examples from the target (positive) class are outnumbered by the non-target 
class. Moreover, in multi-class classification problems where the number of classes are large and 
a one-versus-all scheme is used, the number of examples in each individual class is usually small 
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(a) (b) (c) (d) 




(e) (f) (g) (h) 



Figure 2: (a) The Checkerboard dataset with imbalance ratio 1 : 1000, (b) classification result of BM- 
SVM (c) classification result of BP-SVM with d = 100 and C_i = 1 (CS-SVM with 
k = 1), (d) classification result of CS-SVM with k = 0.5, (e) classification result of CS- 
SVM with k = 0.25, (f) classification result of CS-SVM with k = 0.1, (g) classification 
result of CS-SVM with k = 0.01, (h) classification result of CS-SVM with k = 0.001. 



compared to the rest of the examples, leading to a highly imbalances problem. These sorts of im- 
balances occur with different intensity, with ratios between the minority and majority class ranging 
from 1:10 to 1:10 6 Provost and Fawcett (2001). 

For the SVM training problem, the number of support vectors grows linearly with the number of 
examples Steinwart (2004), and roughly speaking, it could be assumed that the number of support 
vectors for each class grows linearly with the number of examples of each class. Therefore, the 
same imbalance, if not worse, transpires in the solution. In other words, when the dual problem 
is solved, most of the support vectors belong to the majority class. The problem becomes more 
obvious when we take into account the equality constraint of (54) 

^0:^ = 0, (55) 

i 

which implies 

||a + ||i = 1 1 ck 1 1 (56) 

and so for imbalanced datasets 

Card(a+) < Card (a"). (57) 
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(a) (b) 

Figure 3: The CS-SVM algorithm for different choices of k is applied to the covtype UCI dataset 
which is imbalanced with a ratio of 1:211. Starting at k = 1, CS-SVM acts as the BP- 
SVM. (a) shows the reduction in the number of a~ as k decreases and (b) shows the 
reduction in the imbalance ratio as k decreases. As k decreases the number of negative 
support vectors is reduced so that by k = 2 -256 the imbalance ratio between support 
vectors approaches 1. 



This results in an irregular solution, with the aj~s taking values close to the upper bound C and the 
aj s taking values close to the lower bound zero. Wu and Chang (2005) illustrated this problem by 
conducting an experiment on a 2D Checkerboard dataset with different imbalance ratios as seen in 
Figure 2(a) . They showed that in the case of imbalanced data, the decision boundary is unwillingly 
shifted toward the minority class. This is because of a lack of enough examples (support vectors) 
for the minority class that reside close to the correct decision boundary. When enough examples 
don't exist at the right place, the margin relies on other examples farther away from the ideal de- 
cision boundary, resulting in the decision boundary shifting toward the minority class. They also 
equivalently illustrated that this is caused by irregular values in the dual variables. This problem 
persists in the BM-SVM and BP-SVM formulation as a result of their flawed implementation of the 
asymmetric margin, and can be seen in Figure 2(b) and (c) which show the classification results for 
the BM-SVM, BP-SVM and CS-SVM on the Checkerboard dataset. 

Given that for imbalanced dataset problems the vector a~ has small non sparse elements while 
the vector a + is highly sparse (57), the natural remedy is to regularize the non-sparse part of the 
solution, a~ , with a sparsity inducing l\ regularizer Boyd and Vandenberghe (2004). This leads to 
a sparse a~, at the solution which is now both balanced and regularized. The proposed CS-SVM 
formulation uses this method and can deal with the problem of imbalanced datasets through an 
appropriate choice of k in (54). As k tends to zero the regularization coefficient (1 — k) increases 
resulting in an increased regularization of the a^s which in turn results in an increased positive 
margin. Figure 2(g) shows that for small k = 0.01 the decision boundary is close to the ideal. 
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Primal Max. Margin Sep. (w, X) 



Dual 



Dual Dual of: Max. Margin Sep. (a, K) 



XX T ^K 



Reg. Risk Min.(/3,if) 



Dual 



4 Reg. Risk Min.(z, K' 1 ) 



Figure 4: The commutative diagram for existing SVM formulations essentially depends on asso- 
ciated parameter spaces w, a, /3, z and feature spaces X, K, K~ x . The matrix X is the 
Cholesky factor of K, with its i th row corresponding to the feature space representation 
of the example X{, i.e. Xj = tp{xi). 



Choosing k < 0.01 violates the condition of 53 and leads to preferring the majority class as seen in 
Figure 2(h). 

As an added example, Figure 3 illustrates the effect of the CS-SVM regularization on the number 
of support vectors of each class in the solution. The CS-SVM algorithm with different choices of k 
is applied to the covtype UCI dataset which is imbalanced with a ratio of 1:211. The or become 
sparse as the regularizer coefficient (1 — k) increases as seen in Figure 3(a). This leads to an 
equivalence between the number of non-zero components of oT and a + as seen in Figure 3(b). 

In summary, the CS-SVM in the dual, performs a sparsity inducing l\ regularization on the or . 
When dealing with imbalanced datasets,the CS-SVM implicitly prevents unwanted movement of 
the discriminant boundary toward the minority class which is equivalent to learning an asymmetric 
margin in the primal in favor of the minority class. 

5.1.2 Cost-sensitive learning 

According to the previous discussion, regularization on any class results in a smaller margin for that 
class. So, in cost-sensitive learning, CS-SVM reduces the margin for the class with the lower cost, 
or equivalently increases the margin for the class with the higher cost. 

In general, the extra l\ regularization in the CS-SVM dual problem makes the margin asym- 
metric, in favor of the minority class or the class with higher cost for imbalanced data learning and 
cost-sensitive learning, respectively. 



5.2 Regularization on basis expansion coefficients 

In the previous section we showed how the Lagrange dual Boyd and Vandenberghe (2004) of the 
CS-SVM performed l\ regularization on the support vectors. Rather, in this section we show that the 
Fenchel dual Rockafellar (1970) of the CS-SVM performs l\ regularization on the basis coefficients 
of the discriminant function. A general regularization problem Tikhonov and Arsenin (1977) can 
be written as 

argmin + (58) 
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where L is the loss function, A is the regularization (trade-off) parameter, and VL is the regularizer 
function. The Fenchel-dual of (58) is Rifkin and Lippert (2007) 

argmin - ~ J>*G/i, -z) (59) 

i 

where z is the dual variable and * denotes the Fenchel-conjugate of the associated function. 
For functions within an infinite dimensional Hilbert space % and in the form of 

/g(x) =Y,frK(x, Xi ) + b, (60) 



the so-called unregularized bias form Poggio et al. (2001); Rifkin and Lippert (2007) can be written 



argmin ±\\f - b\\ 2 n + £ L(y it f ( Xi )). (61) 

After substituting fp in (61), the resulting problem becomes Chapelle (2007, Appendix B): 

argmin ^/3 T Kf3 + V L{ yi {p T Ki + b)) (62) 



th 

The Fenchel dual of the regularized risk minimization of (61), can be written as 1 



where is the i column of the kernel matrix 



argmin - — 1|/* - 6||^. - ^ L*(y h 



(63) 



2A 1 

subject to l T z = 0, 
where from the representer theorem Wahba (1990); Scholkopf and Smola (2001) 

/;(x) = Y J ZiK-\*,x t )+b. (64) 

i 

The dual norm of ||.||^* is associated with the Hilbert space %* equipped with the kernel matrix 
K~ x . It has been shown (see Kloft et al. (201 1), Section 3.5.) that for the conjugate of non-Isotropic 
norms (Mahalanobis distances) in Euclidean space 

(\\P\\kT = \\4 2 k-i fori^O. (65) 

Since K~ l is a symmetric positive definite matrix, it is therefore a Gram matrix and thus a kernel 
matrix. This shows that the primal problem and the dual problems are similar in the fact that they 
are both regularized risk minimization problems, but in different spaces with different norms. 
The Fenchel-Legendre conjugate of the CS -Hinge loss (j) cs of (46) is 

* , . I vz s.t. < yz < Ci, for y = +1, 

<fc 8 (y> z ) = \ , Z i \ i (66) 

\nyz s.t. < yz < ^, for y = — 1. 



1. Unregularized bias formulation result in constraint 1 T z = in the dual, for more discussion see Rifkin and Lippert 
(2007), Section 9.1. 
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The dual regularized risk minimization problem of (63) can be written for the CS-Hinge loss (f) cs 
as 

argmax - — z T K~ x z + V yizf + V Ky iZ ~ 

2A i i (67) 

subject to l T z = 

with 

Zi + = {zi\yi = 1}, zf = {zi\yi = -1}. 
which in turn, can be written as 

argmax \z T K~ 1 z + y T z — (1 — IL 

2A (68) 

rp 

subject to I z = 

The CS-SVM has been formulated as a regularized risk minimization problem where the task 
is to find basis coefficients z of the discriminant function /* in the Hilbert space associated with 
the kernel K~ l . Figure 4 depicts, the commutative diagram for the existing SVM formulations, 
which transparently shows connections between the dual problem of regularized risk minimization 
and other formulations. 

The Card(z + ) and the Card(z~) quantitatively reflect the contribution of each class to the 
discriminant function (64). Thus the problem of (68) does an extra l\ regularization on the basis 
expansion coefficients z~ when compared to the CTSVM problem. We study this regularization 
term for both the imbalanced data learning and cost-sensitive learning settings. 

5.2.1 Imbalanced dataset learning 

For the imbalanced dataset learning problem, we have Card(z + ) <C Card(z~ ). This means that the 
discriminant function (64) is mostly made up of data-dependent kernel bases that correspond to the 
majority class. In terms of learning the discriminant function, CTSVM algorithms over train on the 
majority class while under training the minority class. Similar to basis pursuit Chen et al. (1999), 
l\ regularization on the basis expansion coefficients can alleviate the problem of over-training on 
the majority class. Thus, regularizing z~ with the l\ norm in (68), leads to a discriminant function 
where both classes contribute equivalently. 

5.2.2 Cost-sensitive learning 

In the cost-sensitive learning problem, one class is more important to the learning problem. There- 
fore, the class with the higher cost should be trained more accurately and contribute more than 
the other class. Equivalently, the class with lower cost should be trained less than the other class. 
This idea can be implemented by performing l\ regularization on the expansion coefficients of the 
lower-cost class (z~) as seen in the CS-SVM problem of (68). 

5.3 Sensitivity analysis 

In the previous section we studied the implications of the additional regularization term in the ob- 
jective function of the dual of the CS-SVM problem. In this section we examine the modifications 
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in the constraints of the dual of the CS-SVM problem. In optimization theory, when a problem 
is convex and Slaters condition is satisfied, i.e. strong duality holds, the optimal value of the dual 
variables contain local sensitivity information for the problem with perturbed constraints Boyd and 
Vandenberghe (2004). Such analysis could be used to improve the value of the objective function at 
the solution by altering the bounds of the constraints. Here we show that the dual CS-SVM problem 
of (54) can be thought of as the original dual CI-SVM with perturbed constraints. We proceed to 
show that within the cost sensitive and imbalanced data setting, altering the constraints in the form 
of the CS-SVM leads to an improved SVM objective function. 

The dual CI-SVM can be written as a perturbed minimization problem in the form of 

1 T~l rp 

argmax a YKYa + 1 a 

a 2 

subject to a T y = (69) 

a — C <u 
- a < 

were we have perturbed the original constraints by u and by setting u = we retrieve the original 
CI-SVM dual problem. We define d* (u) as the optimal value of the objective for the above perturbed 
problem. Consequently, the original unperturbed CI-SVM optimal objective is d*(0). 

Given that strong duality holds, d*(u) can be written in terms of the Lagrangian of (69) as 

d*(u) =£(a*,C,v*,r*,u) = 

-^a* T YKYa* + \ T a* + C T (a* - C - u) - u* T a* + r*y T a* (70) 

Taking the derivative of the perturbed optimal objective d* (u) with respect to u and evaluating this 
at u = provides local sensitivity information for the perturbed minimization problem Boyd and 
Vandenberghe (2004). For (70) this can be written as 



dd*{u) 



dm 



d£( a *,C,v*,r*,u) 

dui = (71) 



which indicates that the direction of steepest ascent for d*(u) at u = is — £*, the value of the 
slack variables in the primal solution. In other words, the value of the optimal objective d* (u) can 
be best improved by relaxing the constraints (choosing larger Ui > 0) that correspond to larger 
slack variables. This happens to be exactly the case for the CS-SVM. Specifically, the CS-SVM 
constraints of (54) are equivalent to the perturbed CI-SVM constraints of (69) with u + = {ui = 
(Ci — l)C\yi = 1} and u~ = {ui = 2(C_i — \)C\yi = —1}- Given that for imbalanced 
and cost sensitive problems the nonzero slack variables = {£,i\yi = 1} are generally larger 
than the nonzero slack variables £~ = = —1}, equation (71) indicates that the best choice 

for improving the optimal objective function d*{u) is to relax the constraints that correspond to 
positive points yi = 1 more than the constraints that correspond to negative points j/j = — 1 or in 
other words to choose u + > n~ > which is equivalent to the actual CS-SVM requirement of 
choosing C\ > 2C_i — 1 as seen in section 4.1. 
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6. Example-dependent cost-sensitive learning 

In many applications such as computational advertising Agarwal (2011), medical diagnosis Turney 
(2000), information retrieval Martin Szummer (2011), fraud detection Fawcett and Provost (1997); 
Stolfo et al. (2000) and business decision-making Zadrozny et al. (2003) the cost of misclassifying 
an individual example differs from other examples including those of the same class. This gives 
rise to the concept of example-dependent cost-sensitive (ED-CS) learning. There are two basic ap- 
proaches to the problem of ED-CS learning depending on whether the test examples are available 
before the training process or not. When the test examples are available, incorporating this informa- 
tion into the decision making process through transductive learning Vapnik (1982, 1995), in other 
words labeling a specific test set, is applicable and can lead to improved results Joachims (1999); 
Collobert et al. (2006). This learning paradigm has been applied to ED-CS learning problems, and 
the presence of the cost of test examples or their estimation when test costs are unknown, plays an 
important role in example-dependent decision making. In this regard, Zadrozny and Elkan (2001) 
suggest the direct cost-sensitive method to first estimate test costs by regression and then predict the 
test labels so that overall cost is minimized. Although, transductive inference can produce superior 
performance, it is inherently a non-convex combinatorial optimization problem. Moreover, trans- 
ductive inference is not always applicable and in general the test examples are not available before 
training, so inductive learning, learning a general decision rule for all possible test sets, is used. 
Here we present the example dependent cost-sensitive hinge loss (ED-CS -Hinge) for the general 
case of inductive example dependent SVM learning, which does not make any assumption about 
the presence of test examples prior to training. 

6.1 Inductive example dependent cost-sensitive inference 

Many previously proposed ED-CS learning algorithms incorporate example-costs into the learning 
process through various sampling schemes. Brefeld et al. (2003) proposed a method where the 
training examples are resampled according to the example cost probability distribution of the data. 
Zadrozny et al. (2003) presented a general framework for converting any cost-insensitive algorithm 
into an example dependent cost-sensitive algorithm based on resampling the dataset according to 
the example costs. Similar to Boosting, their algorithm aggregates an ensemble of models from 
different samples of the dataset. Despite their intuitive simplicity, resampling methods may suffer 
from over fitting caused by duplicate examples. 

More recently, Scott (201 1) proposed, but failed to implemented, an example-based version of 
BP-SVM loss function which we call ED-BP-Hinge. The ED-BP-Hinge loss is defined for each 
example with label y, decision value / and cost c as 



It can be shown that the ED-BP-Hinge loss is B ayes-consistent by providing regret bounds Scott 
(2011) . 

In dealing with the example dependent cost sensitive learning problem we extend the CS-SVM 
loss of (45) to the ED-CS -Hinge defined as 



0(y,/,c) = c[l -yf\+ 



(72) 




for y = +1, 
for y = — 1. 



(73) 
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the ED-CS-Hinge loss function inherits the benefits of the CS-SVM loss including the added flex- 
ibility of choosing an asymmetric margin of the loss when compared to the ED-BP-Hinge . In the 
experimental section we implement an example dependent cost sensitive S VM based on the ED-CS- 
Hinge loss and show significant improvement over the ED-BP-Hinge based SVM and other SVM 
based algorithms on the KDD98 dataset. 

7. Performance measure 

The evaluation of cost sensitive algorithms requires a flexible performance measure that can incor- 
porate different costs and priors. We adopt the cost sensitive zero-one risk which can be written 

as 

Res =^ x [L Cl ,c_ 1 (/(x),y)|X = x] 

= EE p x| Y (x = x|y = y)PY(y)L Cl ,c- 1 (m,y) 

y x 

= Py(+1) J2 P x|y(X = x|y = +l)L Cll c_ 1 (/(x), +1) 
y x 

+ X)iV(-i) E = x|y = -l)Lc llCU (/(x), -i) 

y x 

= PidP™ + P-xC-xPfp (74) 

where Pi and P_i are the class priors and Ppw and Ppp are the false negative and false positive 
rates respectively. This performance measure readily simplifies to the well known probability of 
error measure Rqi = Pi Pfn + P-iPfp, which we call cost insensitive risk. 

The cost sensitive zero-one risk of (74) can be further justified from the vector optimization 
perspective. Each classifier produces a set of vectors (Pfp, Pfn) hi the cone R 2 which induces 
component wise inequality in R 2 . The minimal elements of this set comprise the Pareto optimal 
frontier Boyd and Vandenberghe (2004) which is also known as the ROC curve in detection theory. 
Different points on the ROC of a classifier can be found by the vector scalarization optimization 
problem of 

min AiPfp + A 2 Pfw- (75) 

Pfp, Pfn 

(76) 

Choosing (Ai, A2) = (P±Ci, P_iC_i) results in the following optimization problem 

min Px^Pfp + P-xC-xPfn- (77) 

Pfp, Pfn 

which has an objective function equal to the cost sensitive zero-one risk of (74). This means 
that by using the cost sensitive zero-one risk as the performance measure and choosing a certain 
(PlCi, P_iC_i) we are in fact finding a certain optimal point on the classifier ROC curve that 
corresponds to (Ai, A2) = (PlCi, P-\C-\). We use the term minimum risk instead of minimum 
cost-sensitive zero-one risk in the rest of the paper. 

When the (P1C1, P_iC_i) are known or can be estimated from the problem, we simply use 
them in the evaluation of the classifier. When the costs or priors of a problem are not known, 
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we choose (Ai, A2) = {P\C\, P-±C-i) to purposely evaluate the classifier at a certain high true 
positive rate (TPR) region of the ROC or a certain high true negative rate (TNR) region of the ROC. 
Simply choosing one (Ai, A2) pair for evaluation is not a robust measure and might favor a certain 
algorithm, so we evaluate the minimum risk at all points within a certain region of the ROC. This is 
similar to finding the t-AUC Wu et al. (2008) which evaluates the area under the ROC curve within 
the 1 to t true negative regions, we extend this method and propose the TP-t-AUC and TN-i-AUC 
to evaluate the area under the ROC curve within the 1 to t true positive and 1 to t true negative 
regions respectively. In the experiments we specifically report both the TP-t-AUC and TN-i-AUC 
in order to demonstrate the CS-SVM's ability in learning models with both high detection and high 
specificity. 

8. Experimental study 

In this section we conduct extensive experiments on 21 real world datasets and compare the BM- 
SVM, BP-SVM and CS-SVM algorithms. The experiments are grouped into four types namely 
cost-sensitive learning with available class-dependent costs (CSA), cost-sensitive learning when 
class-dependent costs are unavailable(CSU), cost-sensitive learning with example-dependent costs 
(CSE) and imbalanced dataset learning(IDL). The datasets and experiments are further explained in 
the following sections. 

8.1 Datasets 

21 datasets, created from 20 distinct datasets, are used to compare the performance of the CS-SVM 
algorithm with other algorithms under different scenarios. Table 1 shows the detailed specifications 
of each dataset. Each dataset is associated with a type of experiment. For example, the KDD98 
dataset is used in the CSE experiment and datasets with large class imbalance ratios are used in IDL 
experiments. For each dataset we choose the class with the higher cost or fewer data points as the 
target or positive class. All multi-class datasets were converted to binary datasets. In particular, 
the binary datasets SIAM(l) and SIAM(2) are datasets which have been constructed from the same 
multi-class dataset but with different target class and thus different imbalance ratios. 2 

8.2 Setup 

The RBF Gaussian kernel k(x, x') = exp — ^\\x — x'\\ 2 is used for all SVM algorithms. We choose 
the hyper parameters of C and 7 by performing a 2D grid search and optimizing the associated per- 
formance measure (risk, TP/TN-i-AUC or income). Given that the size of the datasets are very dif- 
ferent, we avoid over fitting by considering a specific search range and granularity for each dataset, 
but use the same range and granularity for all algorithms. In each iteration of the grid search, the 
performance is evaluated by 10 fold cross-validation for small datasets and evaluated on a separated 
validation set for large datasets which appear in bold font in Table 1 . Once the 2D grid search is 
complete, the hyper parameters are used to train the BM-SVM. Also, the kernel hyper parameter is 
used for training both the BP-SVM and the CS-SVM. 

Without loss of generality, we set C_i = 1 in the BP-SVM experiments. Therefore, when con- 
sidering the BP-SVM experiments we only need to perform an additional 2D grid search for C and 

2. SIAM, Web Spam, IJCNN, MNIST, KDD99 and Covertype data sets were obtained from the LIBSVM data web- 
site.http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets 
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Table 1: Specifications of the benchmark datasets. # of Ex. is the number of example data points. 

# of Feat, is the number of features. Ratio is the class imbalance ratio. Target specifies the 
target or positive class. Type specifies the type of experiment conducted on this dataset. 



Dataset 


-ff Tlv 

if or rsx. 


■ff Coot 

ff or reat. 


Ratio 


Target 


Type 


German Credit 


1 nnn 

1,UUU 


9zt 
ZH- 


1 


2 


r>au (z ) 


PC A 


Heart 


97H 
Ll\j 


1 j 


1 


1 


Presence (2) 


PC A 


j\xiu yy (intrusion uetection ) 




A 9 

4z 


1 


4 


Normal 




Kl)l) 9s (Donation) 


191,779 


,\ -11 1 

479 


1:20 


1 


CSb 


Breast Cancer Diagnostic 


569 


32 


1 


2 


Malignant (M) 


CSU 


Breast Cancer Original 


699 


10 


1 


2 


Malignant (4) 


CSU 


Diabetes 


768 


8 


1 


2 


T T T\ * T_ j. / . 1 \ 

Has Diabet (+1) 


CSU 


Echo-cardiogram 


1 ^9 


1 9 
1Z 


1 


2 


/\iive i ) 




Liver 


345 


6 


1 


1 


1 


CSU 


Sonar 


208 


60 


1 


1 


+1 


CSU 


Tic-Tac-Toe 


958 


9 


1 


2 


Negative 


CSU 


Web Spam 


350,000 


254 


1 


2 


-1 


CSU 


Breast Cancer Prognostic 


198 


34 


1 


3 


Recur ( R ) 


IDL 


Covertype 


581,012 


54 


1:211 


Cottonwood/Willow(4) 


IDL 


Hepatits 


155 


20 


1:4 


Die(l) 


IDL 


IJCNN 


141,691 


2 


1:10 


+1 


IDL 


Isolet 


7,797 


617 


1:25 


K(ll) 


IDL 


MNIST 


70,000 


780 


1:10 


5 


IDL 


SIAM1 


28,596 


30438 


1:2000 


1,6,7,11 


IDL 


SIAM11 


28,596 


30438 


1:716 


11,12 


IDL 


Survival 


306 


3 


1:3 


2 


IDL 



C\. The CS-SVM actually has four independent hyper parameters, including 7. We perform a 3D 
grid search on C, C\ and k when the costs are not determined, and a 2D search on C and k when the 
costs C\ and C_i are available. Note that in the case of available costs (CSA), setting k to a value 
other than k = 2C implicitly means that C-\ is set to a value other than its determined value. 
However, we deliberately allow this in order to make use of the CS-SVM algorithm's asymmetric 
margin advantages. Nevertheless, we use the determined cost of C_i during performance evalua- 
tion. 3 Finally, we use the TP-0.9-AUC and TN-0.9-AUC performance measures when considering 
the IDL and CSU type experiments since the costs are not explicitly known in these experiments. 

8.3 Implementation 

The CS-SVM problem (54) is readily implemented in the dual by modifying the LibSVM Chang 
and Lin (2011) source code. This is done by 1) adding the regularization term to the LibSVM 
objective function and 2) selecting C\ = C\ and C_i = - as the cost parameters. As a result, C, 
7, C\ and i are the CS-SVM solver hyper parameters. 



3. The source code for CS-SVM and all grid searches are available at http://www.svcl.ucsd.edu/projects/costlearning 
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Table 2: Expected risk of datasets with known class-dependent costs. 



Dataset 


BM-SVM 


BP-SVM 


CS-SVM 


German Credit 
Heart 
KDD 99 


0.238804 
0.098467 

0.054657 


0.233108 
0.101688 
0.054657 


0.230757 
0.098467 
0.045242 



Table 3: TP-0.9-AUC on datasets with unknown class costs. 



Dataset 


BM-SVM 


BP-SVM 


CS-SVM 


Breast Cancer D. 


0.33 


0.19 


0.16 


Breast Cancer O. 


0.03 


0.03 


0.03 


Diabetes 


0.36 


0.37 


0.34 


Echo-cardiogram 


0.43 


0.48 


0.35 


Liver 


0.921 


0.921 


0.920 


Sonar 


0.40 


0.40 


0.38 


Tic-Tac-Toe 


0.97 


0.90 


0.88 


Web Spam 


0.03 


0.02 


0.01 



8.4 Experiments on cost-sensitive learning with known class-dependent costs 

Three datasets with known class costs are examined. Namely, the German credit card dataset Geibel 
et al. (2004); Newman et al. (1998), the Statlog Heart Disease Newman et al. (1998) and KDD99 
Elkan (2000) datasets are considered. The minimum risk using the BM-SVM, BP-SVM and CS- 
SVM is shown in Table 2 for each of the CSA datasets. The CS-SVM algorithm outperforms the 
BP-SVM on all datasets, surpasses the BM-SVM on two and ties with the BM-SVM on one dataset. 

8.5 Experiments on cost-sensitive learning with unknown class-dependent costs 

We consider eight datasets which do not have known costs and are not highly imbalanced. Namely, 
we examine the Breast Cancer Diagnostic, Breast Cancer Original, Pima Indian Diabets, Echo- 
cardiogram, Liver, Sonar, Tic-Tac-Toe Newman et al. (1998) and Web Spam Webb et al. (2006) 
datasets. The CS-SVM exhibits improved TP-0.9-AUC (Table 3) and TN-0.9-AUC (Table 4) perfor- 
mance compared to BP-SVM and BM-SVM in 15 out of 16 experiments and ties in one experiment. 

8.6 Experiments on imbalanced data learning 

We examine large datasets with severe imbalance ratios to evaluate the merit of the proposed CS- 
SVM algorithm on imbalanced data learning which could be the most prevailing problem in practice. 
The CS-SVM exhibits improved TP-0.9-AUC (Table 3) and TN-0.9-AUC (Table 4) performance 
compared to BP-SVM and BM-SVM in 17 out of 18 IDL experiments and ties in one experiment. 

8.7 Experiments on cost-sensitive learning with example-dependent cost 

We study example-dependent cost-sensitive learning using the well known KDD98 dataset. This 
dataset contains information about past contributors to charities. The task is to classify individuals 
as donors or non-donors for a new charity so that overall donations are maximized. The cost of 
sending mail and soliciting a donation is 0.68$ and the range of possible donations is 1 — 200$. We 
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Table 4: TN-0.9-AUC on datasets with unknown class costs. 



Dataset 


BM-SVM 


BP-SVM 


CS-SVM 


Breast Cancer D. 


0.40 


0.35 


0.31 


Breast Cancer O. 


0.17 


0.17 


0.16 


Diabetes 


0.69 


0.67 


0.66 


Echo-cardiogram 


0.60 


0.60 


0.35 


Liver 


0.90 


0.95 


0.88 


Sonar 


0.70 


0.62 


0.60 


Tic-Tac-Toe 


0.93 


0.87 


0.86 


Web Spam 


0.03 


0.03 


0.02 



Table 5: TP-0.9-AUC on imbalanced datasets. 



Dataset 


BM-SVM 


BP-SVM 


CS-SVM 




Breast Cancer P. 


0.83 


0.79 


0.76 


IDL 


Covertype 


0.034 


0.020 


0.016 


IDL 


Hepatits 


0.56 


0.40 


0.36 


IDL 


IJCNN 


0.091 


0.034 


0.031 


IDL 


Isolet 


0.86 


0.19 


0.10 


IDL 


MNIST 


0.053 


0.019 


0.017 


IDL 


SIAM1 


0.76 


0.30 


0.29 


IDL 


SIAM11 


0.70 


0.70 


0.70 


IDL 


Survival 


0.89 


0.88 


0.87 


IDL 



use the total profit performance measure Elkan (2001) and evaluate the algorithms according to the 
benefit matrix shown in Table 7. 

A range of different methods and algorithms have been previously used on this dataset and 
some of the most profitable methods are listed in Table 8 and further explained. Wong et al. (2005) 
proposed an ad-hoc algorithm which extracts Focused Association Rules (FAR) for the KDD98 
dataset. The FAR method consist of three subsequent algorithms of rule generating, model building 
and pruning and yields the best profit on the KDD98 dataset. The example dependent MetaCost 
(ED-MetaCost) and direct cost-sensitive method (DCSM) are both implemented by Zadrozny and 
Elkan (2001) and differ in the method used for cost and probability estimation. These two algorithm 
are Transductive in nature, i.e. change the labels so that the overall cost is minimized. Res-DIPOL 
and Res-ED-BP-SVM Geibel et al. (2004) are resampling based algorithms equipped with DIPOL 
and ED-BP-SVM algorithms respectively. For these methods the dataset is resampled according 
to a modified probability distribution. Zadrozny et al. (2003) suggest two types of algorithms for 
cost sensitive learning. The first type are those that directly incorporate the costs into the learning 
algorithm and the second type are black box methods that convert a cost insensitive algorithm into 
a cost sensitive algorithm by resampling the data according to the example costs. The Polynomial 
kernel ED-BP-SVM (P-ED-BP-SVM) directly incorporates the costs into the learning algorithm 
while the proposed black box SVM (BB-CI-SVM) and black box C4.5 (BB-C4.5) are examples of 
the second type proposed in Zadrozny et al. (2003). 

Table 7 also shows results for the example dependent implementations of BM-SVM (ED-BM- 
SVM), BP-SVM (ED-BP-SVM) and CS-SVM (ED-SV-SVM) with Gaussian kernels. The ED-CS- 
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Table 6: TN-0.9-AUC on imbalanced datasets. 



Dataset 


BM-SVM 


BP-SVM 


CS-SVM 




Breast Cancer P. 


0.87 


0.81 


0.80 


IDL 


Covertype 


0.062 


0.062 


0.060 


IDL 


Hepatits 


0.70 


0.70 


0.67 


IDL 


IJCNN 


0.02 


0.02 


0.01 


IDL 


Isolet 


0.86 


0.19 


0.10 


IDL 


MNIST 


0.05 


0.02 


0.02 


IDL 


SIAM1 


0.938 


0.526 


0.525 


IDL 


SIAM11 


1.000 


0.748 


0.739 


IDL 


Survival 


0.66 


0.64 


0.63 


IDL 



Table 7: Benefit matrix for the KDD98 dataset. 





Donor 


Non-donor 


Predicted Donor 




-0.68$ 


Predicted Non-donor 




0$ 



SVM exhibits the best performance among all ED-SVM methods. It also ranks fifth among all 
methods some of which use complicated and compounded schemes. 

9. Conclusion 

In this work, we have extended the recently introduced probability elicitation view of loss function 
design to the cost sensitive classification problem. This extension was applied to the SVM prob- 
lem, so as to produce a cost-sensitive hinge loss function. A cost-sensitive SVM learning algorithm 
was then derived, as the minimizer of the associated risk. Unlike previous SVM algorithms, the 
one now proposed enforces cost sensitivity for both separable and non-separable training data, en- 
forcing a larger margin for the preferred class, independent of the choice of slack penalty. It also 
offers guarantees of optimality, namely classifiers that implement the cost-sensitive Bayes decision 
rule and approximate the cost-sensitive Bayes risk. The dual problem of CS-SVM is studied and 
connections between cost-sensitive learning and regularization theory and sensitivity analysis are 
established. Minimum expected cost-sensitive risk is considered as a metric for evaluating the per- 
formance of binary classifiers in the cost-sensitive and imbalanced data settings. The CS-SVM is 
also readily extended to cost-sensitive learning with example-dependent costs. Empirical evidence 
confirms its superior performance, when compared to previous methods. 
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Table 8: Income of different algorithms on the KDD98 dataset. 



Rank 


Algorithm 


Income 


Comments 


1 


FAR 


$ 20,693 


All ..111 1 (' . 1 -1 ■ , 1 

Ad-hoc method based on sequence of three algorithms 


2 


DCSM 


$ 15,329 


Probability and cost estimation to minimize cost 


3 


BB-C4.5 


$ 15,016 


/~1 AC 111. 

C4.5 on resampled dataset 


4 


KDD-Cup 98 Winner 


<t* 1/1 Til 

$ 14,712 


Rule-based approach 


5 


ED-CS-SVM 


$14,205 


ED-CS-SVM with Gaussian kernel n = 0.97 


6 


ED-MetaCost 


$ 14,113 


Probability and cost estimation to minimize cost 


7 


ED-BP-SVM 


$14,008 


ED-BP-SVM with Gaussian kernel 


8 


Res-DIPOL 


$ 14,045 


DIPOL on resampled dataset 


9 


P-ED-BP-SVM 


$ 13,683 


ED-BP-SVM with Polynomial Kernel 


10 


BB-SVM 


$ 13,152 


CI-SVM on resampled dataset 


11 


Res-ED-BP-SVM 


$ 12,883 


ED-BP-SVM on resampled dataset 


12 


BM-CI-SVM 


$ 10,560 


Standard SVM 


13 


Null Classifier 


$ 10,560 


Predicts all examples as donor 
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